STA141B Final Project: LendingClub Loan Analysis

Yuhan Dai

This is a simple guideline of the project, you can click the following link to get to the particular section and return to the guideline with the go back button.

  1. Part 1: Introduction
  2. Part 2: Statistical Visualization
  3. Part 3: Natrual Language Processing (NLP) Analysis of the Loan Application Description
  4. Part 4: API and Web Scraping of Loan Amount and Inerest Rate
  5. Part 5: Non-Parametric Model Construction
  6. Part 6: Conclusion

Part 0: Setup Google Drive Environment

Part 1: Introduction

1.a Motivation

The financial market is channeling funds from people who save surplus money to those that have a shortage of money. According to figure, in the financial system, there are two ways to make the deal. One is direct finance, which is buying financial instruments directly from lenders. The other one is through financial intermediaries like banks, which are called indirect finance. The advantage of indirect finance is that financial intermediaries can help lower transaction costs and deal with asymmetric information. Therefore, the main function of the financial intermediaries is to select less risky borrowers from the existing information before the transaction and ensure the borrower will repay the loan after the transaction. LendingClub is a bank holding company providing loan services. Although the loan has higher risk and lowers liquidity, 60% of the fund in the bank will be loaned out. Because we all know that higher risks and high returns. Therefore, the services provided by LendingClub must be the major product of the bank. Our main goal is to help the company analyze the existing information of the borrowers to avoid adverse selection which means to avoid selecting the risky borrower with the key question: what are the main factors that can affect the risk level of a borrower? financial%20system.jpg

                            Figure 1: the financial market system

1.b Summary of Raw Dataset

The dataset contains 2,260,701 rows, which is relatively huge. Since we want to predict whether the borrower has the ability to repay the loan, we use the loan status as the independent variable. Also, we only focus on the two statuses of the loan: fully paid and charged off, which represent being able to repay and unable to repay respectively. Thus, only 1,345,310 rows are available for us. Then, we select the dependent variables that might be significant. By understanding the concept of loans, we could screen out 15 potential numeric variables including the loan amount, the fund amount, the interest, the monthly payment, the annual income, the ratio of debt to income, and so on. In addition, we screen out 9 categorical variables that might be significant to predict the loan status including the term, the grade, the homeownership, the purpose, the location, and so on.

Part 2: Statistical Visualization

2.a Correlation between Numeric Features

By performing a heatmap using seaborn for the correlation analysis between numeric features (figure 2), we observe that there are several groups of variables that have relatively high correlations. The loan amount (loan_amount) has a high correlation with the installment (installment); it is reasonable that the borrowers with installment loans have a relatively high amount of loan. Besides, the payments received to date for the total amount funded (total_pymnt) and the portion of the payment received to date of the total amount funded by investors (total_pymnt_inv) have high intercorrelation. The number of open credit lines (open_acc) and the total number of open credit currently in the borrower’s credit file (total_acc) are also highly correlated.

Figure 2: correlation between numeric features

2.b Barplot of Purpose

By constructing a barplot of the different purposes of loans (figure 3) using matplotlib, we could see that the Top 2 purposes of individuals applying for loans are credit cards and debt consolidation. The purpose of debt consolidation shows an absolute leading trend with nearly 80,000 counts, while the second-largest purpose credit card consists of approximately 30,000 counts. There is an interesting fact that the auto loan only plays a small part in the total applications. It’s about the same amount as the people who need the loan for vacation plus moving.

                                Figure 3: barplot of purpose 

2.c Scatter Plot of Loan Amount and Interest Rate By Grade

According to the scatter plot (figure 4), we can see that the interest rate increases as the ordinal grade from A to G increases, which means a higher grade has a negative trend in the interest rate. This might influence the ability of a borrower to pay back the loan. However, in the violin chart (figure 5), it is obvious to see that the loan amount applied by the borrower with a higher grade tends to apply for a larger amount of loan.

Figure 4: scatterplot of loan amount and interest rate by grade
Figure 5: violin chart of loan amount by grade

2.d Home Ownership

We then construct the pie chart (figure 6) of the top 3 major homeownership and the barplot of the loan status by homeownership (figure 7) using matplotlib. The pie chart shows that around half of the homeownership is owned by taking mortgages. In addition, among the rest of the people, the number of people who rent their home is about 3 times more than the people who own their home. The barplot here shows the proportion of the loan status in different homeownership. The people who buy their own houses have the highest proportion of the people

Figure 6: Pie chart of major homeownership    
Figure 7: barplot of loan status by homeownership 

2.e Loan Amount of State

Below is a map (figure 8) that shows the total amount of loans in the United States issued by states. According to the map, California has the highest amount of loans. Besides, New York, Florida, and Texas also have large-scale issued loans. Those are the top four states of the loan amount, which is much larger than the other states.

                                    Figure 8: loan amount distribution map 

2.f Loan Status and Categorical Features

We plot four countplots of loan status by some potential variables that we think are significant in the loan status prediction (figure 9). The top left graph shows that the longer the years of employment, the higher the proportion for the borrower to fully repay the loan. The top right graph shows that the loan status for purpose seems to have the same distribution. The bottom left graph shows that the borrowers with higher grades are more willing to fully repay the loan. The bottom right graph shows that loans with shorter terms have more proportion in the fully paid in loan status, which means long-term loans have a higher risk because of the unpredictable future.

                            Figure 9: countplots of loan status by four potential categorical features

Part 3: NLP Analysis of the Loan Application Description

To understand more about the borrowers, we extract and analyze the description of the loan application. We collect all the short paragraphs left by the borrowers and use the standardization techniques of natural language processing including word tokenization, lowercasing, lemmatization, and stop words removal. Then, we plot the distribution of the term frequency. According to figure 10, we can conclude that the most main purpose of the loan application in LendingClub is the monthly credit card payment and the debt consolidation.

                                Figure 10: frequency distribution for tokens

Part 4: API and Web Scraping of Loan Amount and Inerest Rate

4.a API Accessing with key

We used API accessing with keys to extract two data sets which are the monthly loans and leases of commercial banks and the loan rate or the monthly average loan rate from 2007 to 2018 in the Nasdaq Data Link: https://data.nasdaq.com/data/FRED-federal-reserve-economic-data. We convert the dates to periods and construct two data frames for each of the two data sets. Then, we plot the data using matplotlib. In figures 11 and 12, we can see that the rapid increase in loan amounts and the big drop in the interest rate continued until 2016 because the central bank decided to increase the interest rate to fight against inflation from December 2015. Therefore, the increase in loan amount is slowing down and the interest rate has a small increase after that.

      Figure 11: commercial banks monthly loans amounts              Figure 12: loan rate of commercial banks

How do the financial crisis and the monetary policy affect the loan amount and the interest rate? According to the New York Times, in the financial crisis of 2008, the central government purchased large-scale long-term bonds from the public to increase the demand of the bond market. According to figure 13, as the demand increased, the bond price increased and the interest rate decreased since the price of the bond is negatively related to the interest rate. In addition, since the central bank purchased a large amount of loan, the loan went back to the government and the money was flowing into the market. Thus, we can see in figure 14, the interest rate decreased as the money supply increased. However, as the interest rate decreased, the cost of borrowing decreased, and more loans were issued by the commercial banks. Therefore, we can see that in figures 11 and 12, the interest rate kept decreasing between 2008 and 2016, which led to an increase in loan amount during this period.

123131313.png

    Figure 13: bond market                    Figure 14: money market                     Figure 15: whole market

4.b Web Scraping

Using web scraping techniques, we got the historical mortgage rates in the US for the last 11 years in Value Penguin: https://www.valuepenguin.com/mortgages/historical-mortgage-rates. The mortgage rate is also a kind of interest rate of the mortgage loan. We also construct the data frame and plot the data for this data set we drown above. Since a mortgage loan is also a long-term loan, it is a loan included in the expansionary monetary policy called quantitative easing. The government bought long-term bonds including mortgage loans to help the economy during the financial crisis to lower the interest rate in 2008. We can see that from the time series plot, the interest rate of mortgage loans kept getting lower until 2016 the government started to tighten the monetary policy. Then, the mortgage rate went up. We are where we were in 2016, the bank should take action to respond to the increase of the interest rate of loans.

According to figure 16, we can see that since the aggregate demand increased, the inflation rate (π) kept increasing. Inflation would lead to serious unemployment and price hike problems. As a result, the government used quantitative tightening to increase the interest rate, and the situation reversed after 2016, which matches the change in figures 11 and 12. In the coronavirus financial crisis in the past two and half years, the central bank used the same strategy to stimulate the economy. Therefore, the inflation rate increased a lot. The central bank said that the tight monetary policy will happen from March 2022, which means the interest rate will increase in the short future. In that case, the demand for the loan will decrease since the cost of borrowing will increase. Thus, the company should prepare for the change. For example, the company can ease the term of borrowing to minimize the loss of the price-sensitive customers.

              Figure 16: historical mortgage rates

Part 5: Non-Parametric Model Construction

5.a Data Preprocessing

In the total 24 features, only four dependent features have missing values. We construct a table of the proportion of the missing values for each variable. In the table, the feature employment length in years has the highest proportion of missing values, but 5.8359% is not also small. Due to the large data size, we decide to drop all rows with null values.

We convert all the categorical values into binary and multi-binary variables. For numeric values, we drop four dependent variables that have high correlations with other variables including the monthly payment (installment), the total amount funded (total_pymnt), the portion of the payment received to date of the total amount funded by investors (total_pymnt_inv), and the total number of open credit currently in the borrower’s credit file (total_acc) based on the correlation heatmap we construct in the visualization section. Then, we split the dataset into two groups, 80% of the dataset as the training group and 20% as the testing group, for model training and model testing.

5.b Model Training

The independent variable is not normally distributed so we need to use a non-parametric model for this dataset. According to figure 17, we think the random forest model that can handle skewness is an appropriate model. We train the random forest model with the training dataset and use the fitted model to predict the independent variable loan status with the testing group of the dependent variables. Then, we compare the predicted independent variable with the testing group of the independent variable. There is an 81% accuracy, which means the fitness of the model is relatively high.

     Figure 17: distribution of predict variable 

For logistic regression, the predicted variable is binary and we have already removed the highly correlated features so we meet the assumption of the logistic regression. We also fit the logistic model with the training group and compare the independent variables in the testing group with the predicted variable of the model. The accuracy is also about 81%.

5.c Model Selection

Since both the random forest model and the logistic model show to have the relatively same accuracy, we try to use cross-validation to select one model to fit with the whole dataset. According to the cross-validation, the random forest model tends to have 80.61% accuracy and the logistic regression model has 80.52% accuracy. Thus, we choose to use a random forest model for the following analysis.

5.d Model Implement

We fit a random forest model with the whole dataset and calculate the important score of every dependent variable. We then sort the scores in descending order and plot them with a barplot. According to figure 18, the ratio of debt to income (dti), the credit balance (revol_bal), and the interest rate (int_rate) are 3 of the most significant features. Also, it is interesting to know that the month the borrower opens the account (earlist_cr_line) and the location of the borrower (addr_state) also have a relatively high important score in predicting the ability to repay the loan.

                    Figure 18: barplot of feature significant score

Part 6: Conclusion

In conclusion, if LendingClub wants to provide a more precise prediction of the risk level of the borrowers for the customers, the company can focus more on the proportion of debt to the income and the credit balance of the borrower. Also, since the interest rate is the third most important feature and the interest rate will be increased based on the API analysis we did, the company should pay more attention to the monetary policy announced by the Federal Reserve.

Although the accuracy of the random forest model is higher than the logistic model and we use the random forest model to compare the significance of the dependent features, the difference in the accuracy of the two models is nuanced. For the future study, we can use the $L_1$ regulation to avoid the conservation of the feature selection since $L_1$ may directly set the coefficient of the dependent variable into zero if it evaluates the variable as not significant.

Reference

Ceizyk, D. (2022, February 25). Historical mortgage rates: Averages and trends from the 1970s to 2019. ValuePenguin. Retrieved March 13, 2022, from https://www.valuepenguin.com/mortgages/historical-mortgage-rates

Economist's view. Economist's View: The Challenge of Sovereign Wealth Funds. (n.d.). Retrieved March 13, 2022, from https://economistsview.typepad.com/economistsview/2008/01/the-challenge-o.html

George, N. (2019, April 10). All lending club loan data. Kaggle. Retrieved March 13, 2022, from https://www.kaggle.com/wordsforthewise/lending-club

Nasdaq Data Link. data.nasdaq.com. (n.d.). Retrieved March 13, 2022, from https://data.nasdaq.com/data/FRED-federal-reserve-economic-data

Sommer, J. (2022, January 28). What may be in store as the Fed cuts back on the easy money. The New York Times. Retrieved March 13, 2022, from https://www.nytimes.com/2022/01/28/business/fed-inflation-stocks-bonds.html